34
Quantization of Neural Networks
where GIoU(·) is the generalized intersection over union function [202]. Each Gi reflects the
“closeness” of student proposals to the i-th ground-truth object. Then, we retain highly qual-
ified student proposals around at least one ground truth to benefit object recognition [235]
as:
bS
j =
bS
j ,
GIoU(bGT
i
, bS
j ) > τGi, ∀i
∅,
otherwise,
(2.34)
where τ is a threshold controlling the proportion of distilled queries. After removing object-
empty (∅) queries in ˜qS, we form a distillation-desired query set of students denoted as ˜qS
associated with its object set ˜yS = {˜cS
j ,˜bS
j } ˜
N
j=1. Correspondingly, we can obtain a teacher
query set ˜yT = {˜cT
j ,˜bT
j } ˜
N
j=1. For the j-th student query, its corresponding teacher query is
matched as:
˜cT
j ,˜bT
j = arg max
˜cT
k ,˜bT
k
N
k=1
μ1 GIoU(˜bS
j , bT
k ) −μ2∥˜bS
j −bT
k ∥1,
(2.35)
where μ1 = 2 and μ2 = 5 control the matching function, values of which is to follow [31].
Finally, the upper-level optimization after rectification in Eq. (2.29) becomes:
min
θ
H(˜qS∗|˜qT ).
(2.36)
Optimizing Eq. (2.36) is challenging. Alternatively, we minimize the norm distance be-
tween ˜qS∗and ˜qT , optima of which, i.e., ˜qS∗= ˜qT , is exactly the same with that in
Eq. (2.36). Thus, the final loss for our distribution rectification distillation loss becomes:
LDRD(˜qS∗, ˜qT ) = E[∥˜DS∗−˜DT ∥2],
(2.37)
where we use the Euclidean distance of co-attented feature ˜D (see Eq. 2.26) containing the
information query ˜q for optimization.
In backward propagation, the gradient updating drives the student queries toward their
teacher hints. Therefore, we accomplish our distillation. The overall training losses for our
Q-DETR model are:
L = LGT (yGT , yS) + λLDRD(˜qS∗, ˜qT ),
(2.38)
where LGT is the common detection loss for missions such as proposal classification and
coordinate regression [31], and λ is a trade-offhyper-parameter.
2.4.5
Ablation Study
Datasets. We first conduct the ablative study and hyper-parameter selection on the PAS-
CAL VOC dataset [62], which contains natural images from 20 different classes. We use
the VOC trainval2012, and VOC trainval2007 sets to train our model, which contains
approximately 16k images, and the VOC test2007 set to evaluate our Q-DETR, which
contains 4952 images. We report COCO-style metrics for the VOC dataset: AP, AP50 (de-
fault VOC metric), and AP75. We further conduct the experiments on the COCO 2017
[145] object detection tracking. Specifically, we train the models on COCO train2017
and evaluate the models on COCO val2017. We list the average precision (AP) for
IoUs∈[0.5 : 0.05 : 0.95], designated as AP, using COCO’s standard evaluation metric.
For further analyzing our method, we also list AP50, AP75, APs, APm, and APl.
Implementation Details. Our Q-DETR is trained with the DETR [31] and SMCA-
DETR [70] framework. We select the ResNet-50 [84] and modify it with Pre-Activation
structures and RPReLU [158] function following [155]. PyTorch [185] is used for imple-
menting Q-DETR. We run the experiments on 8 NVIDIA Tesla A100 GPUs with 80 GB